PCA algorithm for dimensionality reduction

Table of contents¶

  1. Introduction
  2. Principal component analysis (PCA) Algorithm
  3. Comparaison with Sklearn PCA object
  4. Conclusion

Introduction ¶

Visualization is a powerful tool for data exploration. But in the general case input dimension is high, so the visualisation is hard task. The dimension reduction is a method that allow to reduce data dimension for visualisation and for other purposes.


The goal of this netbook is to understand the PCA algorithm for dimensionality reduction.
##### Excerpt form wikipedia "Dimensionality reduction" Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable (hard to control or deal with). Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.
Wiki link

Libraries¶

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
In [2]:
#Save Plotly figures with the interactive mode in HTML file 
import plotly
plotly.offline.init_notebook_mode()

Principal component analysis (PCA) Algorithm ¶

Input data¶

Load the glove dictionary¶

For more details about this dictionnry, please see the project entiment analysis with Naive Bayes Vs LSTM keras model

In [3]:
file=r'G:\Mon Drive\Personnel\05_Python_html_ext_code\08_AI_&_data science\Sentiment_analysis\Glove.npz'
In [4]:
loaded = np.load(file,allow_pickle=True)
Glove=loaded['Glove'].tolist()
Creat a list of words and categories¶
In [5]:
words=['car','bus','train', 'woman','man','child','france','italy','germany']
category=['transport','transport','transport','human','human','human','country','country','country']
len(words),len(category)
Out[5]:
(9, 9)
Create an array from word embeddings using Glove dictionary¶
In [6]:
X=[]
for w in words: 
    X.append(Glove[w].tolist())
X=np.array(X)
In [7]:
X.shape
Out[7]:
(9, 50)
Normalization of the input array¶
In [8]:
Xn=(X-X.mean(axis=0))/X.std(axis=0)
In [9]:
Xn.shape
Out[9]:
(9, 50)

*The input data has 50 columns, so it is hard to plot all this columns, The solution is to use the PCA algorithm to reduce the dimension from 50 to 2 only*

Covariance of the input array¶

In [10]:
COV=np.cov(Xn, rowvar=False)
COV.shape
Out[10]:
(50, 50)

Singular Value Decomposition (SVD)¶

EigenValues and EigenVectors of the covariance matrix¶
In [11]:
EigenVals, EigenVecs = np.linalg.eigh(COV)
In [12]:
EigenVecs.shape
Out[12]:
(50, 50)
In [13]:
EigenVals
Out[13]:
array([-2.88818316e-15, -2.53898779e-15, -2.38941499e-15, -2.27499066e-15,
       -2.23097527e-15, -1.90785440e-15, -1.74317347e-15, -1.72006427e-15,
       -1.21593304e-15, -1.17649979e-15, -9.99778220e-16, -8.60631073e-16,
       -7.56937735e-16, -6.74129019e-16, -5.63288864e-16, -4.45314686e-16,
       -4.19832605e-16, -3.93899035e-16, -3.22564427e-16, -1.87234899e-16,
        3.61433213e-18,  5.69336912e-17,  1.52957220e-16,  2.90611120e-16,
        4.07079914e-16,  4.32576628e-16,  4.96066786e-16,  5.53562087e-16,
        7.22395715e-16,  7.33001616e-16,  8.55283910e-16,  9.73504182e-16,
        1.19645835e-15,  1.32362356e-15,  1.70395447e-15,  1.74594450e-15,
        2.14249559e-15,  2.22708761e-15,  2.61559632e-15,  2.89292472e-15,
        3.36249024e-15,  4.82687959e-15,  1.10442638e+00,  2.05198275e+00,
        3.37678465e+00,  3.63730471e+00,  5.36501108e+00,  6.01310240e+00,
        1.42403286e+01,  2.04610594e+01])
Sort EigenValues and EigenVectors¶
In [14]:
# Sort the eigenValues: Descending
index=np.argsort(EigenVals)[::-1]
index
Out[14]:
array([49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33,
       32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16,
       15, 14, 13, 12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0],
      dtype=int64)
In [15]:
# EigenValues sorting
EigenVals=EigenVals[index]
EigenVals
Out[15]:
array([ 2.04610594e+01,  1.42403286e+01,  6.01310240e+00,  5.36501108e+00,
        3.63730471e+00,  3.37678465e+00,  2.05198275e+00,  1.10442638e+00,
        4.82687959e-15,  3.36249024e-15,  2.89292472e-15,  2.61559632e-15,
        2.22708761e-15,  2.14249559e-15,  1.74594450e-15,  1.70395447e-15,
        1.32362356e-15,  1.19645835e-15,  9.73504182e-16,  8.55283910e-16,
        7.33001616e-16,  7.22395715e-16,  5.53562087e-16,  4.96066786e-16,
        4.32576628e-16,  4.07079914e-16,  2.90611120e-16,  1.52957220e-16,
        5.69336912e-17,  3.61433213e-18, -1.87234899e-16, -3.22564427e-16,
       -3.93899035e-16, -4.19832605e-16, -4.45314686e-16, -5.63288864e-16,
       -6.74129019e-16, -7.56937735e-16, -8.60631073e-16, -9.99778220e-16,
       -1.17649979e-15, -1.21593304e-15, -1.72006427e-15, -1.74317347e-15,
       -1.90785440e-15, -2.23097527e-15, -2.27499066e-15, -2.38941499e-15,
       -2.53898779e-15, -2.88818316e-15])
In [16]:
# EigenValues sorting
EigenVecs=EigenVecs[:,index]
The matrix "S": a diagonal matrix of EigenValues¶

This matrix will not be served in the current project

In [17]:
S=np.diag(EigenVals)
S.shape
Out[17]:
(50, 50)
Define the desired dimension of reduced data X¶
In [18]:
Out_dim=2
Choose the n=Out_dim first sorted EigenVectors¶
In [19]:
Sub_EigenVecs=EigenVecs[:,:Out_dim]
Sub_EigenVecs.shape
Out[19]:
(50, 2)

Reminder of the input shape

In [20]:
X.T.shape
Out[20]:
(50, 9)

Calculate the reduced input data¶

In [21]:
Xr=Sub_EigenVecs.T.dot(X.T).T
Xr.shape
Out[21]:
(9, 2)

Reminder of the input shape

In [22]:
X.shape
Out[22]:
(9, 50)
Organize the data in a dataframe¶
In [23]:
df=pd.DataFrame(Xr,columns=['x1','x2'])
df['word']=words
df['category']=category
df
Out[23]:
x1 x2 word category
0 0.823152 2.500449 car transport
1 0.960812 3.640771 bus transport
2 0.436880 2.558087 train transport
3 2.452261 -1.506051 woman human
4 2.094862 -1.170678 man human
5 2.403375 -1.911725 child human
6 -3.407146 -1.153402 france country
7 -3.175158 -0.729738 italy country
8 -3.332977 -1.022581 germany country
Plot the data¶
In [24]:
fig=px.scatter(df,x='x1',y='x2',color='category',text='word',width=800, height=600,
              title='The transmorde data')
fig.show()

We can remark that each category is grouped in a specify area of the figure

Principal component analysis in one function¶

In [25]:
def PCA (X,Out_dim=2,std_norm=False):
    # X.shape: mxn , m is the nember of raws in data, n is the number of columns (features)
    # Out_dim: the wanted output dimension, 2 is good for visualisation
    # std_norm:if True, normalize X using STD also
    
    # Normalization
    if std_norm:
        Xn=(X-X.mean(axis=0))/X.std(axis=0)
    else:
        Xn=(X-X.mean(axis=0))
    # Covariance 
    COV=np.cov(Xn, rowvar=False)
    #======== Singular Value Decomposition (SVD)================
    # Eigenvectors
    EigenVals, EigenVecs = np.linalg.eigh(COV)
    # Sort the eigenValues: Descending
    index=np.argsort(EigenVals)[::-1]
    # Use the index to get a eigenVectors with the same order
    EigenVecs=EigenVecs[:,index]
    # get a sub EigenVectors
    U=EigenVecs[:,:Out_dim]
    #============================================================
    # Comput the reduced X 
    Xr=U.T.dot(X.T).T

    return Xr
Reduced dimension with the PCA defined function¶
In [26]:
Xr2=PCA (X,Out_dim=2)
Organization of the data in a dataframe¶
In [27]:
df2=pd.DataFrame(Xr2,columns=['x1','x2'])
df2['word']=words
df2['category']=category
df2
Out[27]:
x1 x2 word category
0 1.833871 2.239016 car transport
1 2.182165 3.049211 bus transport
2 1.314006 2.543402 train transport
3 1.893575 -2.657787 woman human
4 1.778322 -2.000878 man human
5 1.787517 -2.536059 child human
6 -3.665965 0.056487 france country
7 -3.298916 0.007370 italy country
8 -3.535607 0.202911 germany country
Plot the data¶
In [28]:
fig=px.scatter(df2,x='x1',y='x2',color='category',text='word',width=800, height=600,
               title='The transmorde data with PCA function')
fig.show()

Comparaison with Sklearn PCA object ¶

In [29]:
from sklearn.decomposition import PCA as SklearnPCA

For more information see the link
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Pca instance and fiting¶
In [30]:
pca = SklearnPCA(n_components=2)
pca.fit(X)
Out[30]:
PCA(n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=2)
Dimension reduction¶
In [31]:
Xr_sklearn=pca.transform(X)
Organization of the data in a dataframe¶
In [32]:
dfs=pd.DataFrame(Xr_sklearn,columns=['x1','x2'])
dfs['word']=words
dfs['category']=category
dfs
Out[32]:
x1 x2 word category
0 -1.801764 2.138607 car transport
1 -2.150057 2.948803 bus transport
2 -1.281898 2.442994 train transport
3 -1.861468 -2.758195 woman human
4 -1.746214 -2.101286 man human
5 -1.755409 -2.636467 child human
6 3.698072 -0.043921 france country
7 3.331024 -0.093039 italy country
8 3.567715 0.102503 germany country
Plot the data¶
In [33]:
fig=px.scatter(dfs,x='x1',y='x2',color='category',text='word',width=800, height=600,
              title='Sklearn transmorde data')
fig.show()
Invert the x1 axis¶
In [34]:
dfs.x1*=-1
fig=px.scatter(dfs,x='x1',y='x2',color='category',text='word',width=800, height=600)
fig.show()

The local PCA function and the Sklearn PCA function have the same result.

Transformed input: 3 dimensions instead of 2¶

Transform the data¶
In [35]:
Xr3=PCA (X,Out_dim=3)
In [36]:
Xr3.shape
Out[36]:
(9, 3)
Organize the data in a dataframe¶
In [37]:
df3=pd.DataFrame(Xr3,columns=['x1','x2','x3'])
df3['word']=words
df3['category']=category
df3
Out[37]:
x1 x2 x3 word category
0 1.833871 2.239016 1.986809 car transport
1 2.182165 3.049211 0.101166 bus transport
2 1.314006 2.543402 -0.456373 train transport
3 1.893575 -2.657787 0.990900 woman human
4 1.778322 -2.000878 2.066705 man human
5 1.787517 -2.536059 -1.404173 child human
6 -3.665965 0.056487 0.787283 france country
7 -3.298916 0.007370 0.628147 italy country
8 -3.535607 0.202911 0.335505 germany country
3D Ploting¶
In [38]:
fig=px.scatter_3d(df3, x='x1', y='x2', z='x3',color='category',
                  text='word',width=800, height=600,title='3D ploting of the transormed data')
In [39]:
fig.show()

Conclusion ¶

In this netbook we developed a PCA algorithm for dimensionality reduction using only NumPy library, we compared also the result with Sklearn PCA transformation.

Wikipedia link:¶


https://en.wikipedia.org/wiki/Dimensionality_reduction
https://en.wikipedia.org/wiki/Word2vec
https://en.wikipedia.org/wiki/Principal_component_analysis